2021WI_MSDS_430-DL_SEC58 Python for Data Analysis
GROUP 2 Gaurav Garg, Melissa Gillette, Mark Stockwell
Description of Dataset Source:
COVID DATA - The CSV file contains daily data on the COVID-19 pandemic for the US and individual states. All dates and times are in US eastern time (ET).
Data Source Summary: Almost all of the data compiled is taken directly from the websites of state/territory public health authorities. In cases, data is missing from these websites, it's supplemented with information from official state social media accounts or from press conferences with governors or other state authorities.
Each state has its own set of caveats, that are documented on the project site data page.
Link to Dataset:
This data is regularly tracked, maintained and analyzed by a variety of institutions and governments across the globe. The data is extracted and collated directly from public health authorities and hence can be independently verified.
The data has clear data definitions for clarity and focus.
Many data sources, like the JHU (John Hopkins University) Tracker and New York Times Coronavirus in the US tracker, focus on case counts (positive tests) and deaths. This project includes more complete testing data in an attempt to learn how many tests the US is conducting and how many tests individual states and territories are conducting.
Over the last year it has proven to be a reliable source of information for tracking COVID related testing, infections and deaths across the US.
Goals
The overall goal of this project is to understand the relationship and timing of COVID testing, hospitalizations, ventilator use, and deaths. It is believed these statistics are related and can be used in a predictive way for hospital planning purposes.
# Import File
import pandas as pd
df = pd.read_csv('https://covidtracking.com/data/download/all-states-history.csv')
#df = pd.read_excel("COVID all-states-history.xlsx")
# Analyze File
# TODO: Use the info() method to determine to inspect the variable (column) names, the number of non-null values,
# and the data types for each variable.
df.info()
# TODO: Use the head() method to inspect the first five (or more) rows of the data
df.head()
# TODO: Use the tail() method to inspect the last five (or more) rows of the data
df.tail()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20780 entries, 0 to 20779 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 20780 non-null object 1 state 20780 non-null object 2 death 19909 non-null float64 3 deathConfirmed 9464 non-null float64 4 deathIncrease 20780 non-null int64 5 deathProbable 7573 non-null float64 6 hospitalized 12891 non-null float64 7 hospitalizedCumulative 12891 non-null float64 8 hospitalizedCurrently 17338 non-null float64 9 hospitalizedIncrease 20780 non-null int64 10 inIcuCumulative 3792 non-null float64 11 inIcuCurrently 11534 non-null float64 12 negative 13274 non-null float64 13 negativeIncrease 20780 non-null int64 14 negativeTestsAntibody 1541 non-null float64 15 negativeTestsPeopleAntibody 972 non-null float64 16 negativeTestsViral 5057 non-null float64 17 onVentilatorCumulative 1293 non-null float64 18 onVentilatorCurrently 9122 non-null float64 19 positive 20584 non-null float64 20 positiveCasesViral 14014 non-null float64 21 positiveIncrease 20780 non-null int64 22 positiveScore 20780 non-null int64 23 positiveTestsAntibody 3430 non-null float64 24 positiveTestsAntigen 2233 non-null float64 25 positiveTestsPeopleAntibody 1094 non-null float64 26 positiveTestsPeopleAntigen 633 non-null float64 27 positiveTestsViral 8991 non-null float64 28 recovered 12002 non-null float64 29 totalTestEncountersViral 5231 non-null float64 30 totalTestEncountersViralIncrease 20780 non-null int64 31 totalTestResults 20614 non-null float64 32 totalTestResultsIncrease 20780 non-null int64 33 totalTestsAntibody 4504 non-null float64 34 totalTestsAntigen 3421 non-null float64 35 totalTestsPeopleAntibody 2200 non-null float64 36 totalTestsPeopleAntigen 999 non-null float64 37 totalTestsPeopleViral 9228 non-null float64 38 totalTestsPeopleViralIncrease 20780 non-null int64 39 totalTestsViral 14469 non-null float64 40 totalTestsViralIncrease 20780 non-null int64 dtypes: float64(30), int64(9), object(2) memory usage: 6.5+ MB
| date | state | death | deathConfirmed | deathIncrease | deathProbable | hospitalized | hospitalizedCumulative | hospitalizedCurrently | hospitalizedIncrease | ... | totalTestResults | totalTestResultsIncrease | totalTestsAntibody | totalTestsAntigen | totalTestsPeopleAntibody | totalTestsPeopleAntigen | totalTestsPeopleViral | totalTestsPeopleViralIncrease | totalTestsViral | totalTestsViralIncrease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20775 | 2020-01-17 | WA | NaN | NaN | 0 | NaN | NaN | NaN | NaN | 0 | ... | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 20776 | 2020-01-16 | WA | NaN | NaN | 0 | NaN | NaN | NaN | NaN | 0 | ... | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 20777 | 2020-01-15 | WA | NaN | NaN | 0 | NaN | NaN | NaN | NaN | 0 | ... | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 20778 | 2020-01-14 | WA | NaN | NaN | 0 | NaN | NaN | NaN | NaN | 0 | ... | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 20779 | 2020-01-13 | WA | NaN | NaN | 0 | NaN | NaN | NaN | NaN | 0 | ... | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
5 rows × 41 columns
We hope to find answers to the following questions:
1) Is there a data correlation (lead and lag indicators) that can help predict future hospitalizations. The benefit of this analysis would enable better hospital capacity planning.
2) Is there is a correlation between staff experience with the disease over time that would decrease the use of ventilators and deaths. The benefit of this analysis and correlation would stress the importance of training staff at an accelerated rate (particulary on latest treatment protocols) to ultimately decrease the death rate. This analysis would also help future pandimic response with planning by identifying certain metrics that once hit, improvements should occur as far as disease rate goes.
Our hunch is that the increase in positive test ratios correlate to future hospizalizations - regardless of absolute numbers of tests, the positive test ratio is a proxy for overall disease prevalence in the population. Our other hunch is that with greater disease experience and evolution of treatments the ratio of ventilator usage and deaths to total hospitalization will decrease. In other words, greater insight into the disease combined with effective treatment options will lower the percentage of patients needing ventilators and dying.
#Aggregate the Data by State and Confirmed Deaths
df_death_state =df['deathIncrease'].groupby(df['state']).agg(['count','sum'])
df_death_state.sort_values(by='sum',ascending=False)[:50]
| count | sum | |
|---|---|---|
| state | ||
| CA | 369 | 54124 |
| TX | 370 | 44451 |
| NY | 371 | 39029 |
| FL | 404 | 32266 |
| PA | 370 | 24349 |
| NJ | 392 | 23574 |
| IL | 369 | 23014 |
| GA | 369 | 17906 |
| OH | 368 | 17656 |
| MI | 372 | 16658 |
| MA | 411 | 16417 |
| AZ | 369 | 16328 |
| IN | 375 | 12737 |
| TN | 368 | 11543 |
| NC | 369 | 11502 |
| AL | 366 | 10148 |
| LA | 366 | 9748 |
| VA | 406 | 9596 |
| SC | 369 | 8754 |
| MO | 366 | 8161 |
| MD | 368 | 7955 |
| CT | 373 | 7704 |
| WI | 370 | 7106 |
| MS | 366 | 6808 |
| MN | 367 | 6550 |
| CO | 369 | 5989 |
| IA | 367 | 5558 |
| AR | 367 | 5319 |
| WA | 420 | 5041 |
| NV | 368 | 5037 |
| KY | 367 | 4819 |
| KS | 367 | 4812 |
| OK | 366 | 4534 |
| NM | 368 | 3808 |
| RI | 372 | 2541 |
| WV | 367 | 2325 |
| OR | 369 | 2296 |
| NE | 387 | 2113 |
| PR | 357 | 2059 |
| UT | 366 | 1976 |
| SD | 366 | 1900 |
| ID | 366 | 1879 |
| ND | 366 | 1478 |
| DE | 367 | 1473 |
| MT | 366 | 1381 |
| NH | 369 | 1184 |
| DC | 368 | 1030 |
| ME | 366 | 706 |
| WY | 372 | 682 |
| HI | 369 | 445 |
import plotly.graph_objects as go
df_death_state.reset_index(inplace=True)
fig = go.Figure(data=go.Choropleth(
locations=df_death_state['state'], # Spatial coordinates
z = df_death_state['sum'].astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Deaths",
))
fig.update_layout(
title_text = 'COVID Death Rate',
geo_scope='usa', # limite map scope to USA
)
#fig.show()
#Identify the Sample States that will be analyzed for our project based on severity of COVID
#Washington state is included as it was the first reported case of COVID and was the first fatality reported in the US
sample_state_list = ['NY','OH','MI','PA','CA','TX','FL','WA']
df_sample = df[df['state'].isin(sample_state_list)]
df_sample.shape
(3044, 41)
The population represented is based on COVID related data captured across all the US states and 6 Territories (American Samoa, Virgin Islands, Puerto Rico, District of Columbia, Mariana Islands and Guam).

print('The total sample size has {} Rows and {} Columns.'.format(df.shape[0],df.shape[1]))
The total sample size has 20780 Rows and 41 Columns.
However, in our analysis we will focus on the key states (Top 5 States Ranked in terms of Deaths) and the State of Washington where the first COVID linked Fatality and Infection were reported in the US.
print('After Filtering the dataset for these states, the overall sample dataset has - {} rows.'.format(df_sample.shape[0]))
After Filtering the dataset for these states, the overall sample dataset has - 3044 rows.
#df_death_state.info()
df_death_state_sample = df_death_state[df_death_state['state'].isin(sample_state_list)]
df_death_state_sample
| state | count | sum | |
|---|---|---|---|
| 5 | CA | 369 | 54124 |
| 10 | FL | 404 | 32266 |
| 24 | MI | 372 | 16658 |
| 37 | NY | 371 | 39029 |
| 38 | OH | 368 | 17656 |
| 41 | PA | 370 | 24349 |
| 47 | TX | 370 | 44451 |
| 52 | WA | 420 | 5041 |
Chart - Breakdown of Deaths per State and the Count of Rows Available.
Based on intial data set, there are 41 variables. The info() command provides detailed technical metadata, specifically the datatypes of each column (most are numeric). The dataset includes column headers with descriptive names which are generally helpful in identifying the meaning of each variable. In addition there are multiple sources of business/semantic metadata for each column:
The spreadsheet and API can be read programatically and linked to the dataset metadata to produce a consolidated data dictionary. The code below reads each source and joins them in a pandas data frame, and then exports to an excel file for easy viewing.
import pandas as pd
import json
import requests
# Check for original data, load if not there. The if stmnt speeds up development process
if "df" not in globals():
df = pd.read_csv('https://covidtracking.com/data/download/all-states-history.csv')
# Download the JSON data from API as text.
# Note only one day needed since field definitions are included each day
json_data = json.loads(requests.get('https://api.covidtracking.com/v2beta/us/daily/2021-01-02.json').text)
#show the keys in the json data, 'meta' is tag we are looking for
print('json keys:',list(json_data.keys()));
# Extract the meta section from the JSON to simplify the extraction of field definitions
df_meta = pd.DataFrame(json_data["meta"])
json_meta = pd.json_normalize(df_meta.field_definitions)
json_meta.convert_dtypes()
# json_meta.prior_names is a list, but can be empty. Add a column taking first val
json_meta['Column'] = json_meta.prior_names.apply(lambda x: x[0] if len(x)>0 else None)
# colums deprecated and prior_names no longer needed. Rename remaining cols
json_meta.drop(columns=['prior_names','deprecated'], inplace=True)
json_meta.columns = ['json_name','json_field','Column']
# Print output
json_meta.info();
print('#'*30, ' JSON METADATA BELOW ', '#'*30)
json_meta
json keys: ['meta', 'data'] <class 'pandas.core.frame.DataFrame'> RangeIndex: 16 entries, 0 to 15 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 json_name 16 non-null object 1 json_field 14 non-null object 2 Column 14 non-null object dtypes: object(3) memory usage: 512.0+ bytes ############################## JSON METADATA BELOW ##############################
| json_name | json_field | Column | |
|---|---|---|---|
| 0 | Total test results | tests.pcr.total | totalTestResults |
| 1 | Hospital discharges | NaN | None |
| 2 | Confirmed Cases | cases.confirmed | positiveCasesViral |
| 3 | Cumulative hospitalized/Ever hospitalized | outcomes.hospitalized.total | hospitalizedCumulative |
| 4 | Cumulative in ICU/Ever in ICU | outcomes.hospitalized.in_icu | inIcuCumulative |
| 5 | Cumulative on ventilator/Ever on ventilator | hospitalization.on_ventilator.cumulative | onVentilatorCumulative |
| 6 | Currently hospitalized/Now hospitalized | hospitalization.hospitalized.currently | hospitalizedCurrently |
| 7 | Currently in ICU/Now in ICU | hospitalization.in_icu.currently | inIcuCurrently |
| 8 | Currently on ventilator/Now on ventilator | hospitalization.on_ventilator.currently | onVentilatorCurrently |
| 9 | Deaths (probable) | outcomes.death.probable | deathProbable |
| 10 | Deaths (confirmed) | outcomes.death.confirmed | deathConfirmed |
| 11 | Deaths (confirmed and probable) | outcomes.death.cumulative | death |
| 12 | Probable Cases | NaN | None |
| 13 | Last Update (ET) | meta.updated | lastUpdateEt |
| 14 | New deaths | outcomes.death.calculated.change_from_prior_day | deathIncrease |
| 15 | Date | date | dateModified |
import pandas as pd
import json
import requests
# Get metadata from https://docs.google.com/spreadsheets/d/e/2PACX-1vRpgAzsSuN9BGtPv8DXMoct3lIRSpmHeFzzI0J7Sc1Ym68RP89QTSFJp4UBFuPMxpdlHpAGKBOtLHIG/pubhtml#
df_sheet = pd.read_html('https://docs.google.com/spreadsheets/d/e/2PACX-1vRpgAzsSuN9BGtPv8DXMoct3lIRSpmHeFzzI0J7Sc1Ym68RP89QTSFJp4UBFuPMxpdlHpAGKBOtLHIG/pubhtml#',
match='API Name', header=3, skiprows=0)
# read_html, it brings back list of dataframes, take the first one
# https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-html
pd.options.display.width=None
sheet_meta = df_sheet[0]
# drop first col, it's just a row num
sheet_meta.drop(sheet_meta.columns[0], axis=1, inplace=True)
# Rename the "API Name" to "Column" to match other metadata
sheet_meta.rename(columns={"API Name": "Column"}, inplace=True)
# print output
print('#'*30, ' SHEET METADATA BELOW ', '#'*30)
sheet_meta
############################## SHEET METADATA BELOW ##############################
| Column | Website Name | Source Notes Name | |
|---|---|---|---|
| 0 | death | Deaths (confirmed and probable) | Deaths (confirmed and probable) |
| 1 | deathConfirmed | Deaths (confirmed) | Deaths (confirmed) |
| 2 | deathProbable | Deaths (probable) | Deaths (probable) |
| 3 | hospitalizedCumulative | Cumulative hospitalized | Cumulative Hospitalized |
| 4 | hospitalizedCurrently | Currently hospitalized | Currently Hospitalized |
| 5 | hospitalizedDischarged | Hospital discharges | Hospital Discharges |
| 6 | inIcuCumulative | Cumulative in ICU | Cumulative in ICU |
| 7 | inIcuCurrently | Currently in ICU | Currently in ICU |
| 8 | lastUpdateEt | Last Update (ET) | Local Time |
| 9 | negative | Negative PCR tests (people) | Negative (People or Cases) |
| 10 | negativeTestsAntibody | Negative antibody tests (specimens) | Negative Antibody Tests |
| 11 | negativeTestsAntigen | Negative antigen tests (specimens) | Negative Antigen Tests |
| 12 | negativeTestsPeopleAntibody | Negative antibody tests (people) | Negative Antibody Tests (People) |
| 13 | negativeTestsPeopleAntigen | Negative antigen tests (people) | Negative Antigen Tests (People) |
| 14 | negativeTestsViral | Negative PCR tests (people) | Negative Tests (PCR) |
| 15 | onVentilatorCumulative | Cumulative on Ventilator | Cumulative on Ventilator |
| 16 | onVentilatorCurrently | Currently on Ventilator | Currently on Ventilator |
| 17 | pending | Pending | Pending |
| 18 | positive | Cases (confirmed plus probable) | Positive Cases (People, confirmed + probable) |
| 19 | positiveCasesViral | Confirmed cases | Positive Cases (PCR) |
| 20 | positiveTestsAntibody | Positive antibody tests (specimens) | Positive Antibody Test |
| 21 | positiveTestsAntigen | Positive antigen tests (specimens) | Positive Antigen Tests |
| 22 | positiveTestsPeopleAntibody | Positive antibody tests (people) | Positive Antibody Tests (People) |
| 23 | positiveTestsPeopleAntigen | Positive antigen tests (people) | Positive Antigen Tests (People) |
| 24 | positiveTestsViral | Positive PCR tests (specimens) | Positive Tests (PCR) |
| 25 | probableCases | Probable cases | Probable Cases |
| 26 | recovered | Recovered | Recovered |
| 27 | totalTestsEncountersVIral | Total PCR tests (test encounters) | Total Test Encounters (PCR) |
| 28 | totalTestsAntibody | Total antibody tests (specimens) | Total Antibody Tests |
| 29 | totalTestsAntigen | Total antigen tests (specimens) | Total Antigen Tests |
| 30 | totalTestsPeopleAntibody | Total antibody tests (people) | Total Antibody Tests (People) |
| 31 | totalTestsPeopleAntigen | Total antigen tests (people) | Total Antigen Tests (People) |
| 32 | totalTestsPeopleViral | Total PCR tests (people) | Total PCR Tests (People) |
| 33 | totalTestsViral | Total PCR tests (specimens) | Total Tests (PCR) |
import pandas as pd
# Join the sheet and json
meta2 = pd.merge(sheet_meta, json_meta, on='Column', how='outer')
# Get columndata from source dataframe
# The describe command has great info but the cols are column names.
# Transpose used to swap the cols with rows
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
meta_source = df.describe().transpose()
meta_source.convert_dtypes()
# this creates column called "index" which has column names
meta_source.reset_index(inplace=True)
meta_source.rename(columns={"index": "Column"}, inplace=True)
# Get the data types of each col similar to info() command
df_types = pd.DataFrame(df.dtypes)
# rename column 1 to something meaningful
df_types.columns = ['dtype']
# this creates column called "index" which has column names
df_types.reset_index(inplace=True)
df_types.rename(columns={"index": "Column"}, inplace=True)
# merge with source
data_dict = pd.merge(df_types, meta_source, on='Column', how='left')
data_dict = pd.merge(data_dict, meta2, on='Column', how='left')
data_dict.to_excel('DataDictionary.xlsx')
data_dict
| Column | dtype | count | mean | std | min | 25% | 50% | 75% | max | Website Name | Source Notes Name | json_name | json_field | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | date | object | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | state | object | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | death | float64 | 19909.0 | 3.686049e+03 | 6.283560e+03 | 0.0 | 163.00 | 1111.0 | 4390.00 | 54124.0 | Deaths (confirmed and probable) | Deaths (confirmed and probable) | Deaths (confirmed and probable) | outcomes.death.cumulative |
| 3 | deathConfirmed | float64 | 9464.0 | 3.757161e+03 | 4.153410e+03 | 0.0 | 600.00 | 2394.5 | 5443.25 | 21177.0 | Deaths (confirmed) | Deaths (confirmed) | Deaths (confirmed) | outcomes.death.confirmed |
| 4 | deathIncrease | int64 | 20780.0 | 2.479028e+01 | 6.016152e+01 | -201.0 | 0.00 | 6.0 | 24.00 | 2559.0 | NaN | NaN | New deaths | outcomes.death.calculated.change_from_prior_day |
| 5 | deathProbable | float64 | 7573.0 | 4.182385e+02 | 5.380187e+02 | 0.0 | 80.00 | 217.0 | 461.00 | 2594.0 | Deaths (probable) | Deaths (probable) | Deaths (probable) | outcomes.death.probable |
| 6 | hospitalized | float64 | 12891.0 | 1.131988e+04 | 1.760722e+04 | 1.0 | 1039.00 | 4965.0 | 13018.50 | 89995.0 | NaN | NaN | NaN | NaN |
| 7 | hospitalizedCumulative | float64 | 12891.0 | 1.131988e+04 | 1.760722e+04 | 1.0 | 1039.00 | 4965.0 | 13018.50 | 89995.0 | Cumulative hospitalized | Cumulative Hospitalized | Cumulative hospitalized/Ever hospitalized | outcomes.hospitalized.total |
| 8 | hospitalizedCurrently | float64 | 17338.0 | 1.190639e+03 | 2.060080e+03 | 0.0 | 167.00 | 531.0 | 1279.00 | 22851.0 | Currently hospitalized | Currently Hospitalized | Currently hospitalized/Now hospitalized | hospitalization.hospitalized.currently |
| 9 | hospitalizedIncrease | int64 | 20780.0 | 4.228147e+01 | 2.102032e+02 | -4124.0 | 0.00 | 0.0 | 37.00 | 16373.0 | NaN | NaN | NaN | NaN |
| 10 | inIcuCumulative | float64 | 3792.0 | 1.932687e+03 | 1.953288e+03 | 6.0 | 499.25 | 1294.0 | 2449.50 | 9263.0 | Cumulative in ICU | Cumulative in ICU | Cumulative in ICU/Ever in ICU | outcomes.hospitalized.in_icu |
| 11 | inIcuCurrently | float64 | 11534.0 | 3.517279e+02 | 5.855802e+02 | 0.0 | 59.00 | 168.0 | 372.00 | 5225.0 | Currently in ICU | Currently in ICU | Currently in ICU/Now in ICU | hospitalization.in_icu.currently |
| 12 | negative | float64 | 13274.0 | 8.492344e+05 | 1.345004e+06 | 0.0 | 54647.25 | 305972.0 | 1057705.25 | 10186941.0 | Negative PCR tests (people) | Negative (People or Cases) | NaN | NaN |
| 13 | negativeIncrease | int64 | 20780.0 | 3.588982e+03 | 1.157859e+04 | -968686.0 | 0.00 | 129.0 | 3931.00 | 212974.0 | NaN | NaN | NaN | NaN |
| 14 | negativeTestsAntibody | float64 | 1541.0 | 1.383834e+05 | 1.872974e+05 | 587.0 | 11908.00 | 73707.0 | 155187.00 | 864153.0 | Negative antibody tests (specimens) | Negative Antibody Tests | NaN | NaN |
| 15 | negativeTestsPeopleAntibody | float64 | 972.0 | 1.887110e+05 | 2.002181e+05 | 1.0 | 54874.00 | 100282.0 | 261121.00 | 816231.0 | Negative antibody tests (people) | Negative Antibody Tests (People) | NaN | NaN |
| 16 | negativeTestsViral | float64 | 5057.0 | 1.806762e+06 | 2.491662e+06 | 1.0 | 292985.00 | 922827.0 | 2306389.00 | 16887410.0 | Negative PCR tests (people) | Negative Tests (PCR) | NaN | NaN |
| 17 | onVentilatorCumulative | float64 | 1293.0 | 5.734370e+02 | 4.348208e+02 | 32.0 | 219.00 | 412.0 | 818.00 | 1533.0 | Cumulative on Ventilator | Cumulative on Ventilator | Cumulative on ventilator/Ever on ventilator | hospitalization.on_ventilator.cumulative |
| 18 | onVentilatorCurrently | float64 | 9122.0 | 1.516069e+02 | 2.077955e+02 | 0.0 | 29.00 | 86.0 | 185.00 | 2425.0 | Currently on Ventilator | Currently on Ventilator | Currently on ventilator/Now on ventilator | hospitalization.on_ventilator.currently |
| 19 | positive | float64 | 20584.0 | 1.652204e+05 | 3.268324e+05 | 0.0 | 5768.25 | 46204.0 | 178030.75 | 3501394.0 | Cases (confirmed plus probable) | Positive Cases (People, confirmed + probable) | NaN | NaN |
| 20 | positiveCasesViral | float64 | 14014.0 | 1.815002e+05 | 3.371105e+05 | 0.0 | 11559.50 | 71328.0 | 206624.50 | 3501394.0 | Confirmed cases | Positive Cases (PCR) | Confirmed Cases | cases.confirmed |
| 21 | positiveIncrease | int64 | 20780.0 | 1.383803e+03 | 3.023323e+03 | -7757.0 | 65.00 | 435.0 | 1337.00 | 71734.0 | NaN | NaN | NaN | NaN |
| 22 | positiveScore | int64 | 20780.0 | 0.000000e+00 | 0.000000e+00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | NaN | NaN | NaN | NaN |
| 23 | positiveTestsAntibody | float64 | 3430.0 | 1.936076e+04 | 2.853254e+04 | 0.0 | 974.00 | 8185.5 | 25012.50 | 190026.0 | Positive antibody tests (specimens) | Positive Antibody Test | NaN | NaN |
| 24 | positiveTestsAntigen | float64 | 2233.0 | 3.183724e+04 | 4.192966e+04 | 0.0 | 1085.00 | 13661.0 | 49010.00 | 211546.0 | Positive antigen tests (specimens) | Positive Antigen Tests | NaN | NaN |
| 25 | positiveTestsPeopleAntibody | float64 | 1094.0 | 2.051697e+04 | 2.926756e+04 | 0.0 | 3155.50 | 11956.0 | 19059.00 | 178979.0 | Positive antibody tests (people) | Positive Antibody Tests (People) | NaN | NaN |
| 26 | positiveTestsPeopleAntigen | float64 | 633.0 | 2.525905e+04 | 2.413925e+04 | 3.0 | 2682.00 | 17763.0 | 47012.00 | 81803.0 | Positive antigen tests (people) | Positive Antigen Tests (People) | NaN | NaN |
| 27 | positiveTestsViral | float64 | 8991.0 | 1.977728e+05 | 3.510590e+05 | 0.0 | 15754.50 | 64890.0 | 223736.50 | 2628176.0 | Positive PCR tests (specimens) | Positive Tests (PCR) | NaN | NaN |
| 28 | recovered | float64 | 12002.0 | 9.424977e+04 | 2.110444e+05 | 2.0 | 3379.00 | 17619.0 | 93154.75 | 2502609.0 | Recovered | Recovered | NaN | NaN |
| 29 | totalTestEncountersViral | float64 | 5231.0 | 2.702098e+06 | 5.070892e+06 | 0.0 | 193793.50 | 905322.0 | 2780541.50 | 39695100.0 | NaN | NaN | NaN | NaN |
| 30 | totalTestEncountersViralIncrease | int64 | 20780.0 | 5.576445e+03 | 2.125331e+04 | -16946.0 | 0.00 | 0.0 | 0.00 | 324671.0 | NaN | NaN | NaN | NaN |
| 31 | totalTestResults | float64 | 20614.0 | 2.186918e+06 | 4.436516e+06 | 0.0 | 104012.00 | 655267.0 | 2264766.50 | 49646014.0 | NaN | NaN | Total test results | tests.pcr.total |
| 32 | totalTestResultsIncrease | int64 | 20780.0 | 1.750667e+04 | 3.359577e+04 | -130545.0 | 1205.75 | 6125.5 | 19075.00 | 473076.0 | NaN | NaN | NaN | NaN |
| 33 | totalTestsAntibody | float64 | 4504.0 | 1.500727e+05 | 1.979854e+05 | 0.0 | 16141.50 | 75333.0 | 188968.50 | 1054711.0 | Total antibody tests (specimens) | Total Antibody Tests | NaN | NaN |
| 34 | totalTestsAntigen | float64 | 3421.0 | 3.089196e+05 | 4.232854e+05 | 1.0 | 20047.00 | 123384.0 | 432727.00 | 2664340.0 | Total antigen tests (specimens) | Total Antigen Tests | NaN | NaN |
| 35 | totalTestsPeopleAntibody | float64 | 2200.0 | 1.654217e+05 | 1.781903e+05 | 1.0 | 54912.75 | 103967.5 | 183103.00 | 995580.0 | Total antibody tests (people) | Total Antibody Tests (People) | NaN | NaN |
| 36 | totalTestsPeopleAntigen | float64 | 999.0 | 1.681882e+05 | 1.437481e+05 | 3.0 | 37675.50 | 144130.0 | 255251.00 | 580372.0 | Total antigen tests (people) | Total Antigen Tests (People) | NaN | NaN |
| 37 | totalTestsPeopleViral | float64 | 9228.0 | 9.649620e+05 | 1.412589e+06 | 0.0 | 143813.25 | 422659.5 | 1224232.00 | 11248247.0 | Total PCR tests (people) | Total PCR Tests (People) | NaN | NaN |
| 38 | totalTestsPeopleViralIncrease | int64 | 20780.0 | 2.739375e+03 | 2.161433e+04 | -1122327.0 | 0.00 | 0.0 | 2543.00 | 898259.0 | NaN | NaN | NaN | NaN |
| 39 | totalTestsViral | float64 | 14469.0 | 2.311843e+06 | 4.450738e+06 | 0.0 | 135986.00 | 739140.0 | 2511567.00 | 49646014.0 | Total PCR tests (specimens) | Total Tests (PCR) | NaN | NaN |
| 40 | totalTestsViralIncrease | int64 | 20780.0 | 1.296143e+04 | 3.755355e+04 | -1154583.0 | 0.00 | 1873.0 | 12443.00 | 2164543.0 | NaN | NaN | NaN | NaN |
Select the rows (States) to include, drop inconsistent data.
##1 - Eliminate Rows with missing or inconsistent values
# load up modules
import pandas as pd
import numpy as np
# load for visuals
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
# set up notebook to display multiple output in one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# read in file from module 5
if "df" not in globals():
df = pd.read_csv('https://covidtracking.com/data/download/all-states-history.csv')
#Filtering the data to focus on the key states -
#Identify the Sample States that will be analyzed for our project based on severity of COVID
#Washington state is included as it was the first reported case of COVID and was the first fatality reported in the US
sample_state_list = ['NY','OH','MI','PA','CA','TX','FL','WA']
df_sample = df[df['state'].isin(sample_state_list)]
rows_initial = df_sample.shape[0]
print('Before dropping inconsistent values, the dataframe had {} rows'.format(rows_initial)) #Shape of the data
# Data in first quarter of 2020 was inconsistent, limit to 01-APR-2020 and newer
df_sample = df_sample[df_sample['date'] >= '2020-04-01']
rows_after_date_filter = df_sample.shape[0]
print(f'After dropping rows before 01-Apr-2020, the dataframe had {rows_after_date_filter} rows')
#Drop Rows with Missing or Inconsistent Values
df_sample.dropna()
#Drop Rows where our Primary Metric Death(s) is Null
deaths_null_indexes = df_sample[df_sample.death.isnull()].index
df_sample = df_sample.drop(deaths_null_indexes)
rows_after_dropna = df_sample.shape[0]
print(f'After dropping inconsistent values, the dataframe had {rows_after_dropna} rows') #Shape of the data
print('Number of Inconsistent Rows Removed:', rows_initial - rows_after_dropna)
Before dropping inconsistent values, the dataframe had 3044 rows After dropping rows before 01-Apr-2020, the dataframe had 2728 rows After dropping inconsistent values, the dataframe had 2728 rows Number of Inconsistent Rows Removed: 316
#### 2 - Drop Duplicate Rows from the Sample
df_sample.drop_duplicates()
rows_after_dropduplicates = df_sample.shape[0]
print('After dropping duplicate values, the dataframe had {} rows'.format(rows_after_dropduplicates)) #Shape of the data
print('Number of Dupllicate Rows Removed:',rows_after_dropna - rows_after_dropduplicates)
| date | state | death | deathConfirmed | deathIncrease | deathProbable | hospitalized | hospitalizedCumulative | hospitalizedCurrently | hospitalizedIncrease | ... | totalTestResults | totalTestResultsIncrease | totalTestsAntibody | totalTestsAntigen | totalTestsPeopleAntibody | totalTestsPeopleAntigen | totalTestsPeopleViral | totalTestsPeopleViralIncrease | totalTestsViral | totalTestsViralIncrease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2021-03-07 | CA | 54124.0 | NaN | 258 | NaN | NaN | NaN | 4291.0 | 0 | ... | 49646014.0 | 133186 | NaN | NaN | NaN | NaN | NaN | 0 | 49646014.0 | 133186 |
| 10 | 2021-03-07 | FL | 32266.0 | NaN | 66 | NaN | 82237.0 | 82237.0 | 3307.0 | 92 | ... | 22339182.0 | 64599 | 1054711.0 | NaN | 995580.0 | NaN | 11248247.0 | 23190 | 19482607.0 | 52132 |
| 24 | 2021-03-07 | MI | 16658.0 | 15666.0 | 0 | 992.0 | NaN | NaN | 866.0 | 0 | ... | 10621967.0 | 0 | 537073.0 | NaN | NaN | NaN | NaN | 0 | 10621967.0 | 0 |
| 37 | 2021-03-07 | NY | 39029.0 | NaN | 59 | NaN | 89995.0 | 89995.0 | 4789.0 | 0 | ... | 39695100.0 | 227768 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 38 | 2021-03-07 | OH | 17656.0 | 14752.0 | 0 | 2594.0 | 50881.0 | 50881.0 | 820.0 | 33 | ... | 10257157.0 | 29289 | NaN | 1798150.0 | NaN | NaN | NaN | 0 | 10257157.0 | 29289 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19077 | 2020-04-01 | NY | 1941.0 | NaN | 391 | NaN | 19929.0 | 19929.0 | 12226.0 | 2844 | ... | 220934.0 | 15701 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 19078 | 2020-04-01 | OH | 65.0 | NaN | 10 | NaN | 679.0 | 679.0 | NaN | 94 | ... | 30153.0 | 2543 | NaN | NaN | NaN | NaN | NaN | 0 | 30153.0 | 2543 |
| 19081 | 2020-04-01 | PA | 74.0 | NaN | 11 | NaN | 620.0 | 620.0 | NaN | 106 | ... | 49527.0 | 6034 | NaN | NaN | NaN | NaN | 48232.0 | 5744 | NaN | 0 |
| 19087 | 2020-04-01 | TX | 58.0 | NaN | 17 | NaN | NaN | NaN | 196.0 | 0 | ... | 86329.0 | 8109 | NaN | NaN | NaN | NaN | NaN | 0 | 86329.0 | 8109 |
| 19092 | 2020-04-01 | WA | 282.0 | NaN | 20 | NaN | NaN | NaN | NaN | 0 | ... | 93374.0 | 4962 | NaN | NaN | NaN | NaN | 90110.0 | 4717 | NaN | 0 |
2728 rows × 41 columns
After dropping duplicate values, the dataframe had 2728 rows Number of Dupllicate Rows Removed: 0
Not all columns are needed. We are interested in hospitalizations, ICU usage, Ventilator usage, deaths, and testing.
# Start
print('Number of Rows in the Dataset',len(df_sample))
# REMOVED MGS 3/3
# To drop any columns that don't have data identified in 70% of the rows
# thresh = len(df_sample) * .7
# print('Minimum threshold to analyze the column',thresh)
# df_sample_filter = df_sample.dropna(thresh = thresh, axis = 1, inplace = False)
# Only keep variables needed
cols_initial = df_sample.shape[1]
df_sample_filter = df_sample[['date','state',
'hospitalizedCurrently','hospitalizedIncrease',
'positiveIncrease','totalTestResultsIncrease',
'inIcuCurrently','onVentilatorCurrently',
'deathIncrease','death']]
# Convert data types to simpler ones
df_sample_filter = df_sample_filter.convert_dtypes()
df_sample_filter['date'] = pd.to_datetime(df_sample_filter['date'], errors='coerce')
# print stats, dump to Excel
cols_after_inconsistentcolumns = df_sample_filter.shape[1]
print('After dropping columns, the dataframe had {} columns'.format(cols_after_inconsistentcolumns)) #Shape of the data
print('Number of Columns Removed:',cols_initial - cols_after_inconsistentcolumns)
print('Shape: ', df_sample_filter.shape) #Shape After Removing
df_sample_filter.info() #Information on the Dataframe After Dropping Columns
df_sample_filter.to_excel('CovidDataCleaned.xlsx', index=False)
Number of Rows in the Dataset 2728 After dropping columns, the dataframe had 10 columns Number of Columns Removed: 31 Shape: (2728, 10) <class 'pandas.core.frame.DataFrame'> Int64Index: 2728 entries, 5 to 19092 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2728 non-null datetime64[ns] 1 state 2728 non-null string 2 hospitalizedCurrently 2580 non-null Int64 3 hospitalizedIncrease 2728 non-null Int64 4 positiveIncrease 2728 non-null Int64 5 totalTestResultsIncrease 2728 non-null Int64 6 inIcuCurrently 1896 non-null Int64 7 onVentilatorCurrently 1575 non-null Int64 8 deathIncrease 2728 non-null Int64 9 death 2728 non-null Int64 dtypes: Int64(8), datetime64[ns](1), string(1) memory usage: 320.3 KB
###Plotting the Data
df_sample_filter.info()
df_sample_filter.sort_index(ascending=False).plot(x='date',y='deathIncrease', figsize=(12,6),
title='Total Deaths by day')
df_sample_filter.pivot(index="date",
columns="state",
values="death").plot(figsize=(12,6),
title='Cumulative Deaths by State')
plt.show();
<class 'pandas.core.frame.DataFrame'> Int64Index: 2728 entries, 5 to 19092 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2728 non-null datetime64[ns] 1 state 2728 non-null string 2 hospitalizedCurrently 2580 non-null Int64 3 hospitalizedIncrease 2728 non-null Int64 4 positiveIncrease 2728 non-null Int64 5 totalTestResultsIncrease 2728 non-null Int64 6 inIcuCurrently 1896 non-null Int64 7 onVentilatorCurrently 1575 non-null Int64 8 deathIncrease 2728 non-null Int64 9 death 2728 non-null Int64 dtypes: Int64(8), datetime64[ns](1), string(1) memory usage: 320.3 KB
import seaborn as sns
# Create 2 column box plots for variables of interest
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(4,2, figsize=(16,12))
columns = ['hospitalizedCurrently','hospitalizedIncrease',
'positiveIncrease','totalTestResultsIncrease',
'inIcuCurrently','onVentilatorCurrently','deathIncrease', 'death']
for col, ax in zip(columns, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8)):
sns.boxplot(x="state", y=col, ax=ax,
data=df_sample_filter
);
plt.show(fig);
# Look at data distributions
df_sample_filter.describe().round()
df_sample_filter.groupby('state').agg('count')
| hospitalizedCurrently | hospitalizedIncrease | positiveIncrease | totalTestResultsIncrease | inIcuCurrently | onVentilatorCurrently | deathIncrease | death | |
|---|---|---|---|---|---|---|---|---|
| count | 2580.0 | 2728.0 | 2728.0 | 2728.0 | 1896.0 | 1575.0 | 2728.0 | 2728.0 |
| mean | 3686.0 | 82.0 | 4612.0 | 61575.0 | 1050.0 | 284.0 | 85.0 | 12191.0 |
| std | 3993.0 | 255.0 | 6511.0 | 66480.0 | 1092.0 | 309.0 | 127.0 | 10927.0 |
| min | 196.0 | -1145.0 | 0.0 | -84769.0 | 52.0 | 8.0 | -201.0 | 58.0 |
| 25% | 738.0 | 0.0 | 786.0 | 17545.0 | 234.0 | 87.0 | 14.0 | 3306.0 |
| 50% | 2278.0 | 0.0 | 2049.0 | 40172.0 | 721.0 | 144.0 | 43.0 | 7870.0 |
| 75% | 5121.0 | 78.0 | 6468.0 | 77850.0 | 1409.0 | 411.0 | 108.0 | 18769.0 |
| max | 22851.0 | 3424.0 | 71734.0 | 473076.0 | 5225.0 | 2425.0 | 2559.0 | 54124.0 |
| date | hospitalizedCurrently | hospitalizedIncrease | positiveIncrease | totalTestResultsIncrease | inIcuCurrently | onVentilatorCurrently | deathIncrease | death | |
|---|---|---|---|---|---|---|---|---|---|
| state | |||||||||
| CA | 341 | 341 | 341 | 341 | 341 | 341 | 0 | 341 | 341 |
| FL | 341 | 241 | 341 | 341 | 341 | 0 | 0 | 341 | 341 |
| MI | 341 | 333 | 341 | 341 | 341 | 333 | 333 | 341 | 341 |
| NY | 341 | 341 | 341 | 341 | 341 | 341 | 305 | 341 | 341 |
| OH | 341 | 310 | 341 | 341 | 341 | 310 | 310 | 341 | 341 |
| PA | 341 | 335 | 341 | 341 | 341 | 125 | 336 | 341 | 341 |
| TX | 341 | 341 | 341 | 341 | 341 | 229 | 0 | 341 | 341 |
| WA | 341 | 338 | 341 | 341 | 341 | 217 | 291 | 341 | 341 |
Based on counts by state, following statements on the data can be made:
NOTE: Remaining analysis will focus on states with best data: MI, NY, OH
This column will be needed to test hypothesis 1
# Create derived column, add to frame
df_sample_filter.info()
df_sample_filter["positiveRatio"] = (df_sample_filter.positiveIncrease / df_sample_filter.totalTestResultsIncrease)
df_sample_filter.info()
df_sample_filter
<class 'pandas.core.frame.DataFrame'> Int64Index: 2728 entries, 5 to 19092 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2728 non-null datetime64[ns] 1 state 2728 non-null string 2 hospitalizedCurrently 2580 non-null Int64 3 hospitalizedIncrease 2728 non-null Int64 4 positiveIncrease 2728 non-null Int64 5 totalTestResultsIncrease 2728 non-null Int64 6 inIcuCurrently 1896 non-null Int64 7 onVentilatorCurrently 1575 non-null Int64 8 deathIncrease 2728 non-null Int64 9 death 2728 non-null Int64 dtypes: Int64(8), datetime64[ns](1), string(1) memory usage: 320.3 KB <class 'pandas.core.frame.DataFrame'> Int64Index: 2728 entries, 5 to 19092 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2728 non-null datetime64[ns] 1 state 2728 non-null string 2 hospitalizedCurrently 2580 non-null Int64 3 hospitalizedIncrease 2728 non-null Int64 4 positiveIncrease 2728 non-null Int64 5 totalTestResultsIncrease 2728 non-null Int64 6 inIcuCurrently 1896 non-null Int64 7 onVentilatorCurrently 1575 non-null Int64 8 deathIncrease 2728 non-null Int64 9 death 2728 non-null Int64 10 positiveRatio 2728 non-null Float64 dtypes: Float64(1), Int64(8), datetime64[ns](1), string(1) memory usage: 344.3 KB
| date | state | hospitalizedCurrently | hospitalizedIncrease | positiveIncrease | totalTestResultsIncrease | inIcuCurrently | onVentilatorCurrently | deathIncrease | death | positiveRatio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2021-03-07 | CA | 4291 | 0 | 3816 | 133186 | 1159 | <NA> | 258 | 54124 | 0.028652 |
| 10 | 2021-03-07 | FL | 3307 | 92 | 4024 | 64599 | <NA> | <NA> | 66 | 32266 | 0.062292 |
| 24 | 2021-03-07 | MI | 866 | 0 | 0 | 0 | 222 | 97 | 0 | 16658 | NaN |
| 37 | 2021-03-07 | NY | 4789 | 0 | 6789 | 227768 | 999 | 682 | 59 | 39029 | 0.029807 |
| 38 | 2021-03-07 | OH | 820 | 33 | 735 | 29289 | 260 | 171 | 0 | 17656 | 0.025095 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19077 | 2020-04-01 | NY | 12226 | 2844 | 7917 | 15701 | 3022 | <NA> | 391 | 1941 | 0.504235 |
| 19078 | 2020-04-01 | OH | <NA> | 94 | 348 | 2543 | <NA> | <NA> | 10 | 65 | 0.136846 |
| 19081 | 2020-04-01 | PA | <NA> | 106 | 962 | 6034 | <NA> | <NA> | 11 | 74 | 0.15943 |
| 19087 | 2020-04-01 | TX | 196 | 0 | 730 | 8109 | <NA> | <NA> | 17 | 58 | 0.090023 |
| 19092 | 2020-04-01 | WA | <NA> | 0 | 568 | 4962 | <NA> | <NA> | 20 | 282 | 0.11447 |
2728 rows × 11 columns
Sort data ascending by date, smooth data with 7 day average
import seaborn as sns
df_ny = df_sample_filter[df_sample_filter.state=="NY"]
df_ny = df_ny.set_index("date")
df_ny.dropna(inplace=True)
df_ny.plot.line(subplots=True,layout=[5,2], figsize=[18,9]);
Smooth data with 7 day average
# OnVentilator columns null in may 2020, drop them
print('Shape of df_ny before: ', df_ny.shape)
df_ny = df_ny.dropna()
print('Shape of df_ny after: ', df_ny.shape)
# rolling 7 day window
df_ny_rolling = df_ny.rolling(7).mean()
df_ny_rolling.plot.line(subplots=True,layout=[5,2], figsize=[18,9]);
Shape of df_ny before: (305, 10) Shape of df_ny after: (305, 10)
It can be seen in charts that the "choppiness" has disappeared.
Note that the "date" and "state" columns become the index. This makes charting easier and ensures only one date/state combination (uniqueness).
# MI
df_mi = df_sample_filter[df_sample_filter.state=="MI"]
df_mi = df_mi.set_index(["date","state"])
df_mi.dropna(inplace=True)
df_mi_rolling = df_mi.rolling(7).mean()
# OH
df_oh = df_sample_filter[df_sample_filter.state=="OH"]
df_oh = df_oh.set_index(["date","state"])
df_oh.dropna(inplace=True)
df_oh_rolling = df_oh.rolling(7).mean()
# NY
df_ny = df_sample_filter[df_sample_filter.state=="NY"]
df_ny = df_ny.set_index(["date","state"])
df_ny.dropna(inplace=True)
df_ny_rolling = df_ny.rolling(7).mean()
# Create single set with rolling mean by state
df_rolling = pd.concat([df_ny_rolling, df_mi_rolling, df_oh_rolling])
df_rolling
| hospitalizedCurrently | hospitalizedIncrease | positiveIncrease | totalTestResultsIncrease | inIcuCurrently | onVentilatorCurrently | deathIncrease | death | positiveRatio | ||
|---|---|---|---|---|---|---|---|---|---|---|
| date | state | |||||||||
| 2021-03-07 | NY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-03-06 | NY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-03-05 | NY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-03-04 | NY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-03-03 | NY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-05-06 | OH | 1013.714286 | 83.285714 | 611.571429 | 8423.142857 | 394.142857 | 261.428571 | 43.000000 | 1323.857143 | 0.073272 |
| 2020-05-05 | OH | 1019.000000 | 86.285714 | 614.714286 | 8269.142857 | 394.285714 | 263.000000 | 43.000000 | 1280.857143 | 0.075710 |
| 2020-05-04 | OH | 1028.571429 | 83.142857 | 595.285714 | 8011.571429 | 397.285714 | 265.714286 | 43.285714 | 1237.857143 | 0.075865 |
| 2020-05-03 | OH | 1024.142857 | 84.000000 | 623.142857 | 7628.428571 | 399.000000 | 265.857143 | 44.285714 | 1194.571429 | 0.082590 |
| 2020-05-02 | OH | 1034.714286 | 83.428571 | 610.428571 | 7270.142857 | 407.000000 | 272.285714 | 43.428571 | 1150.285714 | 0.084228 |
948 rows × 9 columns
#Heatmap
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Compute the correlation matrix
corr = df_ny_rolling.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 300, as_cmap=True, n=10)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap="RdYlBu", #center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
corr.round(3)
<AxesSubplot:>
| hospitalizedCurrently | hospitalizedIncrease | positiveIncrease | totalTestResultsIncrease | inIcuCurrently | onVentilatorCurrently | deathIncrease | death | positiveRatio | |
|---|---|---|---|---|---|---|---|---|---|
| hospitalizedCurrently | 1.000 | 0.224 | 0.884 | 0.701 | 0.932 | 0.856 | 0.978 | 0.691 | 0.929 |
| hospitalizedIncrease | 0.224 | 1.000 | -0.156 | -0.375 | 0.547 | 0.668 | 0.365 | -0.314 | 0.289 |
| positiveIncrease | 0.884 | -0.156 | 1.000 | 0.877 | 0.679 | 0.538 | 0.818 | 0.706 | 0.887 |
| totalTestResultsIncrease | 0.701 | -0.375 | 0.877 | 1.000 | 0.464 | 0.316 | 0.591 | 0.839 | 0.644 |
| inIcuCurrently | 0.932 | 0.547 | 0.679 | 0.464 | 1.000 | 0.983 | 0.961 | 0.499 | 0.872 |
| onVentilatorCurrently | 0.856 | 0.668 | 0.538 | 0.316 | 0.983 | 1.000 | 0.907 | 0.407 | 0.786 |
| deathIncrease | 0.978 | 0.365 | 0.818 | 0.591 | 0.961 | 0.907 | 1.000 | 0.580 | 0.933 |
| death | 0.691 | -0.314 | 0.706 | 0.839 | 0.499 | 0.407 | 0.580 | 1.000 | 0.487 |
| positiveRatio | 0.929 | 0.289 | 0.887 | 0.644 | 0.872 | 0.786 | 0.933 | 0.487 | 1.000 |
import seaborn as sns
sns.set_theme(style="darkgrid")
fig = plt.figure()
f, ax = plt.subplots(figsize=(16, 6))
columns = ['hospitalizedCurrently','hospitalizedIncrease',
'positiveIncrease', 'positiveRatio',
'inIcuCurrently','onVentilatorCurrently','deathIncrease']
df_ny_rolling.reset_index("state");
for col in columns:
sns.lineplot(x="date", y=col,
data=df_ny_rolling
);
#df_ny_rolling
sns.lineplot(y="value", hue="variable", data=pd.melt(df_ny_rolling[columns]));
<Figure size 432x288 with 0 Axes>
Add veriage here explaining the chart above
# seaborn pair plots Mark . NOTE in today's session Prof mentioned categorical variables. think this is what he wants
#df_rolling.set_index("date",inplace=True)
columns = ['state','deathIncrease','hospitalizedCurrently','inIcuCurrently','onVentilatorCurrently','positiveRatio']
df_rolling.reset_index(inplace=True)
sns.pairplot(df_rolling[columns], hue="state")
<seaborn.axisgrid.PairGrid at 0x1f752c02dc0>
Note in pair plots above there is strong correlation between hospitalizations, ICU, posititivity rate, and death within each state.
Methodology: extract state/date/positiveRatio/hospitalizedCurrently figures. Add additional columns hospitalizedCurrently at T+1 thru T+14. Measure correlation at each day and determine maximum correlation time lag.
# high level view of data with regression lines, interactive
import pandas as pd
import numpy as np
# load for visuals
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
df_rolling.info()
#df_rolling.head()
fig = px.scatter(df_rolling.dropna(), x="positiveIncrease", y="hospitalizedCurrently",
trendline="ols", width=1000, color="state")
fig.show()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 948 entries, 0 to 947 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 948 non-null datetime64[ns] 1 state 948 non-null object 2 hospitalizedCurrently 930 non-null float64 3 hospitalizedIncrease 930 non-null float64 4 positiveIncrease 930 non-null float64 5 totalTestResultsIncrease 930 non-null float64 6 inIcuCurrently 930 non-null float64 7 onVentilatorCurrently 930 non-null float64 8 deathIncrease 930 non-null float64 9 death 930 non-null float64 10 positiveRatio 719 non-null float64 dtypes: datetime64[ns](1), float64(9), object(1) memory usage: 81.6+ KB
from datetime import datetime, timedelta
df_rolling.info()
df_rolling.head()
# extract the columns needed.
columns = ['date','state','positiveRatio','hospitalizedCurrently']
df_h1 = df_rolling[columns]
df_h1 = df_h1.dropna()
min_date = df_h1.date.min()
max_date = df_h1.date.max()
days = pd.Timedelta((max_date - min_date), unit='d').days
print("min_date:",min_date, "max_date:",max_date, "Days:", days)
# Create master index to transpose/shift data
date_index = pd.date_range(start=min_date, periods=days, freq='D')
# Create a dataframe with every combo of date and state
df_test1 = pd.DataFrame(columns=['date','state'])
for state in ('MI','OH','NY'):
df_tmp = pd.DataFrame(date_index.values, columns=['date'])
df_tmp["state"] = state
df_test1 = df_test1.append(df_tmp, sort=True)
# Join to subset of columns in rolling df, note the join on multilevel index
df_test1 = df_test1.set_index(["date","state"])
# df_h1.set_index(["date","state"], inplace=True) <<this generates "A value is trying to be set on a copy of a slice from a DataFrame"
df_h1 = df_h1.set_index(["date","state"])
df_test1 = df_test1.join(df_h1,how='outer')
df_test1
print('*'*80)
# Add the t1 thru t14 columns by copying current data, then offsetting the date and join back
# for a given base date, the t## column gives the value t days in future
for i in range(-1,-15,-1):
df_tmp = df_h1[["hospitalizedCurrently"]]
df_tmp = df_tmp.reset_index()
df_tmp[["date"]] = df_tmp[["date"]] + timedelta(days=i)
df_tmp.rename(columns={"hospitalizedCurrently":"t"+str(-i)}, inplace=True)
df_tmp = df_tmp.set_index(['date','state'])
df_test1 = df_test1.join(df_tmp,how='outer')
df_test1.info()
df_test1.head()
print('shape of df_h1 before dropna:', df_h1.shape)
df_test1.dropna(inplace=True)
print('shape of df_h1 after dropna:', df_h1.shape)
# join df_tmp to master
print("\n",'*'*80)
print(df_test1.corr()[['positiveRatio','hospitalizedCurrently']].round(4),'\n')
df_corr = df_test1.corr().reset_index()
df_corr['time']= df_corr.index.values
# plot the relationship of correlation over time in days
fig = plt.figure(figsize=(8, 3))
df_corr[1:].positiveRatio.plot.line(x="time",
title='Correlation of positivity rate vs. hospitalization by days',
xlabel='days', ylabel='correlation coefficient');
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 948 entries, 0 to 947
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 948 non-null datetime64[ns]
1 state 948 non-null object
2 hospitalizedCurrently 930 non-null float64
3 hospitalizedIncrease 930 non-null float64
4 positiveIncrease 930 non-null float64
5 totalTestResultsIncrease 930 non-null float64
6 inIcuCurrently 930 non-null float64
7 onVentilatorCurrently 930 non-null float64
8 deathIncrease 930 non-null float64
9 death 930 non-null float64
10 positiveRatio 719 non-null float64
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 81.6+ KB
min_date: 2020-04-09 00:00:00 max_date: 2021-03-01 00:00:00 Days: 326
********************************************************************************
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 994 entries, (Timestamp('2020-03-26 00:00:00'), 'MI') to (Timestamp('2021-03-01 00:00:00'), 'OH')
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 positiveRatio 719 non-null float64
1 hospitalizedCurrently 719 non-null float64
2 t1 719 non-null float64
3 t2 719 non-null float64
4 t3 719 non-null float64
5 t4 719 non-null float64
6 t5 719 non-null float64
7 t6 719 non-null float64
8 t7 719 non-null float64
9 t8 719 non-null float64
10 t9 719 non-null float64
11 t10 719 non-null float64
12 t11 719 non-null float64
13 t12 719 non-null float64
14 t13 719 non-null float64
15 t14 719 non-null float64
dtypes: float64(16)
memory usage: 138.2+ KB
shape of df_h1 before dropna: (719, 2)
shape of df_h1 after dropna: (719, 2)
********************************************************************************
positiveRatio hospitalizedCurrently
positiveRatio 1.0000 0.4436
hospitalizedCurrently 0.4436 1.0000
t1 0.4439 0.9995
t2 0.4435 0.9982
t3 0.4423 0.9960
t4 0.4404 0.9929
t5 0.4377 0.9891
t6 0.4344 0.9844
t7 0.4305 0.9790
t8 0.4260 0.9729
t9 0.4207 0.9660
t10 0.4149 0.9585
t11 0.4085 0.9504
t12 0.4016 0.9417
t13 0.3943 0.9324
t14 0.3865 0.9226
The correlation betweeen t1 and t2 is slightly higher than same day comparison of positive test ration and hospitalizations. Overall the correlation is weak(~0.4)
From this we can say an increase in positive test ratio will indicate slight increase in hospitalizations 2 days later.
Methodology: extract state/date/hospitalizedCurrently/onVentilatorCurrently/deathIncrease figures. Add additional columns onVentilatorCurrently, deathIncrease at T+1 thru T+14. Measure correlation at each day and determine maximum correlation time lag.
fig = px.scatter(df_rolling.dropna(), color="state", x="hospitalizedCurrently", y="inIcuCurrently",
trendline="ols", width=900)
fig.show()
fig = px.scatter(df_rolling.dropna(), color="state", x="hospitalizedCurrently", y="deathIncrease",
trendline="ols", width=900)
fig.show()
fig = px.scatter(df_rolling.dropna(), color="state", x="hospitalizedCurrently", y="onVentilatorCurrently",
trendline="ols", width=900)
fig.show()
from datetime import datetime, timedelta
# extract the columns needed.
columns = ['date','state','hospitalizedCurrently','inIcuCurrently','onVentilatorCurrently','deathIncrease']
df_h2 = df_rolling[columns]
df_h2 = df_h2.dropna()
min_date = df_h2.date.min()
max_date = df_h2.date.max()
days = pd.Timedelta((max_date - min_date), unit='d').days
print("min_date:",min_date, "max_date:",max_date, "Days:", days)
# Create master index of dates to transpose/shift data
date_index = pd.date_range(start=min_date, periods=days, freq='D')
# Create a dataframe with every combo of date and state
df_test2 = pd.DataFrame(columns=['date','state'])
for state in ('MI','OH','NY'):
df_tmp = pd.DataFrame(date_index.values, columns=['date'])
df_tmp["state"] = state
df_test2 = df_test2.append(df_tmp, sort=True)
# # Join to subset of columns in rolling df, note the join on multilevel index
df_test2 = df_test2.set_index(["date","state"])
df_h2 = df_h2.set_index(["date","state"])
df_test2 = df_test2.join(df_h2,how='outer')
df_test2.info()
df_h2.info()
print('\n','*'*80,'\n')
# Add the t1 thru t14 columns by copying current data, then offsetting the date and join back
# for a given base date, the t## column gives the value t days in future
column_names = ['hospitalizedCurrently','inIcuCurrently','onVentilatorCurrently','deathIncrease']
column_suffixes = ('hosp','ICU','vent','death')
for col in zip(column_names, column_suffixes):
for i in range(-1,-15,-1):
df_tmp = df_h2[[col[0]]]
df_tmp = df_tmp.reset_index()
df_tmp.date = df_tmp.date + timedelta(days=i)
df_tmp.rename(columns={col[0]:"t"+str(-i)+'_'+col[1]}, inplace=True)
df_tmp = df_tmp.set_index(['date','state'])
df_test2 = df_test2.join(df_tmp,how='outer')
# # join df_tmp to master
print('\n','*'*40,' CORRELATION SECTION,', *'*40')
# compare pairs of variables,
# first is column name, second is suffix, third is plot axis (subplot)
column_pairs = (['hospitalizedCurrently','ICU', 0,0],
['hospitalizedCurrently','vent', 0,1],
['hospitalizedCurrently','death', 0,2],
['inIcuCurrently','vent', 1,0],
['inIcuCurrently','death', 1,1],
['onVentilatorCurrently','death', 1,2])
# iterate through pairs of columns, generate the correlations, plot
fig, axes = plt.subplots(2,3, figsize=(16, 9))
fig.tight_layout(pad=3.0)
for c in column_pairs:
# select the columns needed from dataframe
cols = [c[0]] + list(df_test2.columns[df_test2.columns.str.endswith(c[1])])
df_tmp = df_test2[cols]
# store the correlation, remove the index
df_corr = df_tmp.corr()[c[0]].round(4)
df_corr = df_corr.reset_index()
df_corr.T # transpose the correlation numbers and filter the "time" column
print('\n')
# add a days column
df_corr['days']= df_corr.index.values # The fourth row is t=1 day, shortcut
# plot the data
df_corr[3:].plot.line(ax=axes[c[2],c[3]], x="days", title=c[0]+' vs '+c[1]);
min_date: 2020-04-09 00:00:00 max_date: 2021-03-01 00:00:00 Days: 326
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 981 entries, (Timestamp('2020-04-09 00:00:00'), 'MI') to (Timestamp('2021-03-01 00:00:00'), 'OH')
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hospitalizedCurrently 930 non-null float64
1 inIcuCurrently 930 non-null float64
2 onVentilatorCurrently 930 non-null float64
3 deathIncrease 930 non-null float64
dtypes: float64(4)
memory usage: 44.4+ KB
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 930 entries, (Timestamp('2021-03-01 00:00:00'), 'NY') to (Timestamp('2020-05-02 00:00:00'), 'OH')
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hospitalizedCurrently 930 non-null float64
1 inIcuCurrently 930 non-null float64
2 onVentilatorCurrently 930 non-null float64
3 deathIncrease 930 non-null float64
dtypes: float64(4)
memory usage: 75.0+ KB
********************************************************************************
**************************************** CORRELATION SECTION, * 4 0
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | hospitalizedCurrently | t1_ICU | t2_ICU | t3_ICU | t4_ICU | t5_ICU | t6_ICU | t7_ICU | t8_ICU | t9_ICU | t10_ICU | t11_ICU | t12_ICU | t13_ICU | t14_ICU |
| hospitalizedCurrently | 1.0 | 0.9387 | 0.9456 | 0.9514 | 0.9561 | 0.9595 | 0.9618 | 0.9629 | 0.9628 | 0.9615 | 0.959 | 0.9553 | 0.9504 | 0.9444 | 0.9374 |
<AxesSubplot:title={'center':'hospitalizedCurrently vs ICU'}, xlabel='days'>
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | hospitalizedCurrently | t1_vent | t2_vent | t3_vent | t4_vent | t5_vent | t6_vent | t7_vent | t8_vent | t9_vent | t10_vent | t11_vent | t12_vent | t13_vent | t14_vent |
| hospitalizedCurrently | 1.0 | 0.8673 | 0.881 | 0.8939 | 0.9057 | 0.9164 | 0.9259 | 0.9343 | 0.9415 | 0.9474 | 0.9519 | 0.9551 | 0.9567 | 0.957 | 0.9557 |
<AxesSubplot:title={'center':'hospitalizedCurrently vs vent'}, xlabel='days'>
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | hospitalizedCurrently | t1_death | t2_death | t3_death | t4_death | t5_death | t6_death | t7_death | t8_death | t9_death | t10_death | t11_death | t12_death | t13_death | t14_death |
| hospitalizedCurrently | 1.0 | 0.5828 | 0.5842 | 0.5853 | 0.5859 | 0.5862 | 0.5861 | 0.5858 | 0.5853 | 0.5845 | 0.5834 | 0.5821 | 0.5804 | 0.5784 | 0.5761 |
<AxesSubplot:title={'center':'hospitalizedCurrently vs death'}, xlabel='days'>
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | inIcuCurrently | t1_vent | t2_vent | t3_vent | t4_vent | t5_vent | t6_vent | t7_vent | t8_vent | t9_vent | t10_vent | t11_vent | t12_vent | t13_vent | t14_vent |
| inIcuCurrently | 1.0 | 0.9846 | 0.9892 | 0.9926 | 0.9947 | 0.9954 | 0.9948 | 0.9928 | 0.9894 | 0.9847 | 0.9786 | 0.971 | 0.9622 | 0.9522 | 0.941 |
<AxesSubplot:title={'center':'inIcuCurrently vs vent'}, xlabel='days'>
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | inIcuCurrently | t1_death | t2_death | t3_death | t4_death | t5_death | t6_death | t7_death | t8_death | t9_death | t10_death | t11_death | t12_death | t13_death | t14_death |
| inIcuCurrently | 1.0 | 0.5809 | 0.5761 | 0.5713 | 0.5665 | 0.5616 | 0.5567 | 0.5518 | 0.5467 | 0.5415 | 0.536 | 0.5307 | 0.525 | 0.5189 | 0.5127 |
<AxesSubplot:title={'center':'inIcuCurrently vs death'}, xlabel='days'>
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | onVentilatorCurrently | t1_death | t2_death | t3_death | t4_death | t5_death | t6_death | t7_death | t8_death | t9_death | t10_death | t11_death | t12_death | t13_death | t14_death |
| onVentilatorCurrently | 1.0 | 0.5567 | 0.5479 | 0.5393 | 0.5306 | 0.5222 | 0.5138 | 0.5057 | 0.4972 | 0.4889 | 0.4803 | 0.4719 | 0.4632 | 0.4539 | 0.4446 |
<AxesSubplot:title={'center':'onVentilatorCurrently vs death'}, xlabel='days'>
From the charts above, we can infer the following:
Over time as hospital staff has more disease experience and better therapies evolve, the ratio of ventilator usage and deaths to total hospitalizations has decreased.
# import the data
columns = ['date','state','hospitalizedCurrently','inIcuCurrently','onVentilatorCurrently','deathIncrease']
df_h3 = df_rolling[columns]
# add columns for ratio of hosp-ICU, hosp-vent, hosp-deaths, ICU-vent, ICU-deaths, vent-deaths
column_pairs = (['hospitalizedCurrently','inIcuCurrently', 'hosp_ICU', 0, 0],
['hospitalizedCurrently','onVentilatorCurrently', 'hosp_vent', 0, 1],
['hospitalizedCurrently','deathIncrease', 'hosp_deaths', 0, 2],
['inIcuCurrently','onVentilatorCurrently', 'ICU_vent', 1, 0],
['inIcuCurrently','deathIncrease', 'ICU_deaths', 1,1],
['onVentilatorCurrently','deathIncrease', 'vent_deaths', 1,2])
for c in column_pairs:
df_h3[c[2]] = (df_h3[c[1]] / df_h3[c[0]] )
df_h3 = df_h3.convert_dtypes()
df_h3.info()
# plot data
# iterate through pairs of columns, generate the correlations, plot
fig, axes = plt.subplots(2,3, figsize=(16, 9))
fig.tight_layout(pad=3.0)
for c in column_pairs:
df_tmp = df_h3[["date", "state", c[0],c[1], c[2]]]
df_tmp = df_tmp.set_index(["date", "state"])
df_tmp = df_tmp.dropna()
#sns.lineplot(df_tmp, ax=axes[c[3],c[4]])
#df_rolling.reset_index(inplace=True)
#sns.lineplot(data=df_tmp.dropna(), x=df_tmp[["date"]], y=df_tmp[["hosp_vent"]])
df_tmp.unstack(level=-1).plot.line(y=c[2],ax=axes[c[3],c[4]], title=c[2]);
<ipython-input-29-39a745e5b8ec>:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
<class 'pandas.core.frame.DataFrame'> RangeIndex: 948 entries, 0 to 947 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 948 non-null datetime64[ns] 1 state 948 non-null string 2 hospitalizedCurrently 930 non-null Float64 3 inIcuCurrently 930 non-null Float64 4 onVentilatorCurrently 930 non-null Float64 5 deathIncrease 930 non-null Float64 6 hosp_ICU 930 non-null Float64 7 hosp_vent 930 non-null Float64 8 hosp_deaths 930 non-null Float64 9 ICU_vent 930 non-null Float64 10 ICU_deaths 930 non-null Float64 11 vent_deaths 930 non-null Float64 dtypes: Float64(10), datetime64[ns](1), string(1) memory usage: 98.3 KB
<AxesSubplot:title={'center':'hosp_ICU'}, xlabel='date'>
<AxesSubplot:title={'center':'hosp_vent'}, xlabel='date'>
<AxesSubplot:title={'center':'hosp_deaths'}, xlabel='date'>
<AxesSubplot:title={'center':'ICU_vent'}, xlabel='date'>
<AxesSubplot:title={'center':'ICU_deaths'}, xlabel='date'>
<AxesSubplot:title={'center':'vent_deaths'}, xlabel='date'>
From the charts above it can be seen that